The Company They Keep: Extracting Japanese Neologisms Using Language Patterns

نویسندگان

  • James Breen
  • Timothy Baldwin
چکیده

We describe an investigation into the identification and extraction of unrecorded potential lexical items in Japanese text by detecting text passages containing selected language patterns typically associated with such items. We identified a set of suitable patterns, then tested them with two large collections of text drawn from the WWW and Twitter. Samples of the extracted items were evaluated, and it was demonstrated that the approach has considerable potential for identifying terms for later lexicographic analysis.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated Extraction of Swedish Neologisms using a Temporally

This thesis presents an automated system for extracting neologisms using machine learning approaches. The neologisms are extracted from a large temporally annotated corpus containing newspaper articles and blog posts. We find that our system is different from much of the previous research on neologism extraction and justify these differences by relating it to current research in evolutionary li...

متن کامل

Mining and Classification of Neologisms in Persian Blogs

The exponential growth of the Persian blogosphere and the increased number of neologisms create a major challenge in NLP applications of Persian blogs. This paper describes a method for extracting and classifying newly constructed words and borrowings from Persian blog posts. The analysis of the occurrence of neologisms across five distinct topic categories points to a correspondence between th...

متن کامل

Reference Resolution Using Semantic Patterns In Japanese Newspaper Articles

Reference resolution is one of the important tasks in natural language processing. In Japanese newspaper articles, pronouns are not often used as referential expressions for company names, but shortened company names and dousha (“the same company”) are used more often (Muraki et al. 1993). Although there have been studies of reference resolution for various noun phrases in Japanese (Shibata et ...

متن کامل

Identification of Neologisms in Japanese by Corpus Analysis

In Japanese and other languages that do not use spaces or other markers between words, the identification and extraction of neologisms and other unrecorded words presents some particular challenges. In this paper we discuss the problems encountered with neologism identification and describe and discuss some of the methods that have been employed to overcome these problems.

متن کامل

Socio-cultural Patterns in Iranian High School Textbooks from the View point of Motivation for Research

Introduction One very important aspect of any textbook is its content in terms of the motivation it creates in the readers. This is specifically true in EFL textbooks where the learners need more than just content since content-wise, such books are not very much different from the learners’ world knowledge level. That is why material developers working in this area are usually consciously choos...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017